35 research outputs found

    IST Austria Thesis

    Get PDF
    The scalability of concurrent data structures and distributed algorithms strongly depends on reducing the contention for shared resources and the costs of synchronization and communication. We show how such cost reductions can be attained by relaxing the strict consistency conditions required by sequential implementations. In the first part of the thesis, we consider relaxation in the context of concurrent data structures. Specifically, in data structures such as priority queues, imposing strong semantics renders scalability impossible, since a correct implementation of the remove operation should return only the element with highest priority. Intuitively, attempting to invoke remove operations concurrently creates a race condition. This bottleneck can be circumvented by relaxing semantics of the affected data structure, thus allowing removal of the elements which are no longer required to have the highest priority. We prove that the randomized implementations of relaxed data structures provide provable guarantees on the priority of the removed elements even under concurrency. Additionally, we show that in some cases the relaxed data structures can be used to scale the classical algorithms which are usually implemented with the exact ones. In the second part, we study parallel variants of the stochastic gradient descent (SGD) algorithm, which distribute computation among the multiple processors, thus reducing the running time. Unfortunately, in order for standard parallel SGD to succeed, each processor has to maintain a local copy of the necessary model parameter, which is identical to the local copies of other processors; the overheads from this perfect consistency in terms of communication and synchronization can negate the speedup gained by distributing the computation. We show that the consistency conditions required by SGD can be relaxed, allowing the algorithm to be more flexible in terms of tolerating quantized communication, asynchrony, or even crash faults, while its convergence remains asymptotically the same

    Relaxed Schedulers Can Efficiently Parallelize Iterative Algorithms

    Full text link
    There has been significant progress in understanding the parallelism inherent to iterative sequential algorithms: for many classic algorithms, the depth of the dependence structure is now well understood, and scheduling techniques have been developed to exploit this shallow dependence structure for efficient parallel implementations. A related, applied research strand has studied methods by which certain iterative task-based algorithms can be efficiently parallelized via relaxed concurrent priority schedulers. These allow for high concurrency when inserting and removing tasks, at the cost of executing superfluous work due to the relaxed semantics of the scheduler. In this work, we take a step towards unifying these two research directions, by showing that there exists a family of relaxed priority schedulers that can efficiently and deterministically execute classic iterative algorithms such as greedy maximal independent set (MIS) and matching. Our primary result shows that, given a randomized scheduler with an expected relaxation factor of kk in terms of the maximum allowed priority inversions on a task, and any graph on nn vertices, the scheduler is able to execute greedy MIS with only an additive factor of poly(kk) expected additional iterations compared to an exact (but not scalable) scheduler. This counter-intuitive result demonstrates that the overhead of relaxation when computing MIS is not dependent on the input size or structure of the input graph. Experimental results show that this overhead can be clearly offset by the gain in performance due to the highly scalable scheduler. In sum, we present an efficient method to deterministically parallelize iterative sequential algorithms, with provable runtime guarantees in terms of the number of executed tasks to completion.Comment: PODC 2018, pages 377-386 in proceeding

    The Power of Choice in Priority Scheduling

    Full text link
    Consider the following random process: we are given nn queues, into which elements of increasing labels are inserted uniformly at random. To remove an element, we pick two queues at random, and remove the element of lower label (higher priority) among the two. The cost of a removal is the rank of the label removed, among labels still present in any of the queues, that is, the distance from the optimal choice at each step. Variants of this strategy are prevalent in state-of-the-art concurrent priority queue implementations. Nonetheless, it is not known whether such implementations provide any rank guarantees, even in a sequential model. We answer this question, showing that this strategy provides surprisingly strong guarantees: Although the single-choice process, where we always insert and remove from a single randomly chosen queue, has degrading cost, going to infinity as we increase the number of steps, in the two choice process, the expected rank of a removed element is O(n)O( n ) while the expected worst-case cost is O(nlogn)O( n \log n ). These bounds are tight, and hold irrespective of the number of steps for which we run the process. The argument is based on a new technical connection between "heavily loaded" balls-into-bins processes and priority scheduling. Our analytic results inspire a new concurrent priority queue implementation, which improves upon the state of the art in terms of practical performance

    The Transactional Conflict Problem

    Full text link
    The transactional conflict problem arises in transactional systems whenever two or more concurrent transactions clash on a data item. While the standard solution to such conflicts is to immediately abort one of the transactions, some practical systems consider the alternative of delaying conflict resolution for a short interval, which may allow one of the transactions to commit. The challenge in the transactional conflict problem is to choose the optimal length of this delay interval so as to minimize the overall running time penalty for the conflicting transactions. In this paper, we propose a family of optimal online algorithms for the transactional conflict problem. Specifically, we consider variants of this problem which arise in different implementations of transactional systems, namely "requestor wins" and "requestor aborts" implementations: in the former, the recipient of a coherence request is aborted, whereas in the latter, it is the requestor which has to abort. Both strategies are implemented by real systems. We show that the requestor aborts case can be reduced to a classic instance of the ski rental problem, while the requestor wins case leads to a new version of this classical problem, for which we derive optimal deterministic and randomized algorithms. Moreover, we prove that, under a simplified adversarial model, our algorithms are constant-competitive with the offline optimum in terms of throughput. We validate our algorithmic results empirically through a hardware simulation of hardware transactional memory (HTM), showing that our algorithms can lead to non-trivial performance improvements for classic concurrent data structures

    Dynamic Averaging Load Balancing on Cycles

    Get PDF
    We consider the following dynamic load-balancing process: given an underlying graph GG with nn nodes, in each step t0t\geq 0, one unit of load is created, and placed at a randomly chosen graph node. In the same step, the chosen node picks a random neighbor, and the two nodes balance their loads by averaging them. We are interested in the expected gap between the minimum and maximum loads at nodes as the process progresses, and its dependence on nn and on the graph structure. Similar variants of the above graphical balanced allocation process have been studied by Peres, Talwar, and Wieder, and by Sauerwald and Sun for regular graphs. These authors left as open the question of characterizing the gap in the case of \emph{cycle graphs} in the \emph{dynamic} case, where weights are created during the algorithm's execution. For this case, the only known upper bound is of O(nlogn)\mathcal{O}( n \log n ), following from a majorization argument due to Peres, Talwar, and Wieder, which analyzes a related graphical allocation process. In this paper, we provide an upper bound of O(nlogn)\mathcal{O} ( \sqrt n \log n ) on the expected gap of the above process for cycles of length nn. We introduce a new potential analysis technique, which enables us to bound the difference in load between kk-hop neighbors on the cycle, for any kn/2k \leq n / 2. We complement this with a "gap covering" argument, which bounds the maximum value of the gap by bounding its value across all possible subsets of a certain structure, and recursively bounding the gaps within each subset. We provide analytical and experimental evidence that our upper bound on the gap is tight up to a logarithmic factor

    Influence of Tribological Parameters on the Railway Wheel Derailment

    Get PDF
    At present, Nadal’s formula is used for prediction of derailment that contains a limited number of parameters. Besides, insufficient study of laws of variation of the noted parameters and ignorance of the influence of other parameters on the derailment complicate solution of the problem. The sliding distance and the relative sliding velocity are the most sensitive factors contributing to the destruction of the third body. Moreover, increased friction coefficient between the steering surfaces of the wheel and rail promotes climbing of a wheel on the rail and derailment. Dependences of the main parameters, influencing the destruction of the third body, the sliding distance and the relative sliding velocity on the rail track curvature, and difference of diameters of wheels of the wheelset and the non-roundness of one of the wheels of the wheelset are shown in the work. The methods for estimation of the third body destruction degree and consideration in Nadal’s formula of the additional criterion of impossibility of the wheel rolling on the contact point of the wheel and rail steering surfaces, containing a value of this contact point advancing, which in turn depends on the angle of attack, are proposed

    Lower Bounds for Shared-Memory Leader Election Under Bounded Write Contention

    Get PDF
    This paper gives tight logarithmic lower bounds on the solo step complexity of leader election in an asynchronous shared-memory model with single-writer multi-reader (SWMR) registers, for both deterministic and randomized obstruction-free algorithms. The approach extends to lower bounds for deterministic and randomized obstruction-free algorithms using multi-writer registers under bounded write concurrency, showing a trade-off between the solo step complexity of a leader election algorithm, and the worst-case number of stalls incurred by a processor in an execution

    Communication-Efficient Federated Learning With Data and Client Heterogeneity

    Full text link
    Federated Learning (FL) enables large-scale distributed training of machine learning models, while still allowing individual nodes to maintain data locally. However, executing FL at scale comes with inherent practical challenges: 1) heterogeneity of the local node data distributions, 2) heterogeneity of node computational speeds (asynchrony), but also 3) constraints in the amount of communication between the clients and the server. In this work, we present the first variant of the classic federated averaging (FedAvg) algorithm which, at the same time, supports data heterogeneity, partial client asynchrony, and communication compression. Our algorithm comes with a rigorous analysis showing that, in spite of these system relaxations, it can provide similar convergence to FedAvg in interesting parameter regimes. Experimental results in the rigorous LEAF benchmark on setups of up to 300300 nodes show that our algorithm ensures fast convergence for standard federated tasks, improving upon prior quantized and asynchronous approaches

    Provably-Efficient and Internally-Deterministic Parallel Union-Find

    Full text link
    Determining the degree of inherent parallelism in classical sequential algorithms and leveraging it for fast parallel execution is a key topic in parallel computing, and detailed analyses are known for a wide range of classical algorithms. In this paper, we perform the first such analysis for the fundamental Union-Find problem, in which we are given a graph as a sequence of edges, and must maintain its connectivity structure under edge additions. We prove that classic sequential algorithms for this problem are well-parallelizable under reasonable assumptions, addressing a conjecture by [Blelloch, 2017]. More precisely, we show via a new potential argument that, under uniform random edge ordering, parallel union-find operations are unlikely to interfere: TT concurrent threads processing the graph in parallel will encounter memory contention O(T2logVlogE)O(T^2 \cdot \log |V| \cdot \log |E|) times in expectation, where E|E| and V|V| are the number of edges and nodes in the graph, respectively. We leverage this result to design a new parallel Union-Find algorithm that is both internally deterministic, i.e., its results are guaranteed to match those of a sequential execution, but also work-efficient and scalable, as long as the number of threads TT is O(E13ε)O(|E|^{\frac{1}{3} - \varepsilon}), for an arbitrarily small constant ε>0\varepsilon > 0, which holds for most large real-world graphs. We present lower bounds which show that our analysis is close to optimal, and experimental results suggesting that the performance cost of internal determinism is limited
    corecore